Skip to content

Add more testing for chunked prefill#27506

Merged
fzyzcjy merged 627 commits into
mainfrom
tom/scripted_runtime_and_chunked_testing
Jun 9, 2026
Merged

Add more testing for chunked prefill#27506
fzyzcjy merged 627 commits into
mainfrom
tom/scripted_runtime_and_chunked_testing

Conversation

@fzyzcjy

@fzyzcjy fzyzcjy commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Stacked on #27502 (mixed-prefix gsm8k eval), which is this PR's base.

Adds the scripted-runtime test harness (one forward per script yield, precise per-step engine-state assertions over an HTTP-server + scheduler-hook driven engine) and a chunked-prefill test suite (test/manual/chunked_prefill/test_scripted_*.py + test_e2e_*.py), plus kv-canary PP fixtures and the registered scripted-runtime/chunked CI tests.

The whole suite was brought green on H200 (registered unit + GPU, manual scripted 89/89 classes, manual e2e 11/11 files) and passed a multi-agent weakening audit.


CI States

Latest PR Test (Base): 🚫 Run #27205607049
Latest PR Test (Extra): ✅ Run #27205605882

fzyzcjy added 30 commits May 31, 2026 22:22
Per review: a one-line passthrough should call the original object directly,
and any wrapper that must exist should return the original type rather than
invent a new one. Remove all thin ScriptedReqHandle field getters and the
thin ScriptedContext readers (engine_stats, row_pool_used, *_rids, batch_size,
get_chunked_req_rid, chunked_in_flight_count, lock_refs_snapshot); tests now
read r.req.<field> / t._scheduler.<...> directly. Keep start_req kwargs,
abort(handle) (an HTTP action), and list_active_reqs() (aggregates four
scheduler structures and returns List[Req]).
list_active_reqs collapses to list(set(_get_all_reqs(ctx))) (Req is hashable
by identity, so the same object across scheduler structures dedups). Make the
ScriptedContextReqStarter.start_req ignore_eos/priority/dp_rank kwargs required
so the internal plumbing carries no silent defaults; the public facade keeps
its ergonomic defaults.
Per the line-count rule (wrap multi-line computations, inline 1-liners):
kv_pages derives a page count from kv_allocated_len and scheduler.page_size,
and engine_stats aggregates several pool counters into one dict -- both are
real computations, not thin field reads, so they earn a wrapper. Re-add their
registered-core tests.
Returning 0 when find_req_by_rid finds nothing conflated 'req released' with
'req present holding 0 pages', which would let a kv_pages==0 assertion pass
even if the req were wrongly missing -- a silent assertion weakener. Read the
field on the live req instead; the released-state check still holds because
right after run_until_finished the req is still in last_batch with
kv_allocated_len==0.
Replace the removed thin-wrapper calls with direct access to the original
objects: handle field reads become r.req.<field> (output_ids, req_pool_idx,
cached_tokens, finished_reason, ...) and the dropped context readers become
direct t._scheduler.<...> expressions (running_batch rids, chunked_req rid,
req_to_token_pool sizes; lock_refs_snapshot -> get_all_node_lock_refs). Kept
wrappers (kv_pages, engine_stats, abort, list_active_reqs, start_req kwargs)
are used unchanged. Released-state reads that follow a drain are made
None-tolerant (assert r.req is None or ...). These files still reference the
not-yet-implemented multi-line APIs (chunks_done, force_*, exhaust_*, ...) and
do not run until those land.
With tie_word_embeddings, lm_head was tied to model.embed_tokens
unconditionally. Under PP the embedding lives on the first rank, so the
last rank (which computes logits) ended up with a PPMissingLayer head and
crashed in _compute_lm_head. Mirror Qwen2: tie only when the model is on a
single PP rank, otherwise build a real ParallelLMHead on the last rank and
copy the tied embedding weight into it during load_weights.
The kv-canary SingleForwardManager assumes one forward pass at a time;
pipeline parallelism keeps several micro-batches in flight and breaks its
phase checker. Enable the canary only when pp_size == 1.
Under pipeline parallelism, micro-batches stay in flight after the running
batch looks empty; flushing the cache then trips a radix-tree assertion
when a late micro-batch releases a node from the flushed tree. Drive extra
scheduler steps to drain the pipeline before flush_cache.
Lets a script keep a request resident for a fixed number of decode steps
regardless of EOS, which scenarios that need a steady KV-cache occupant
rely on. Defaults to False so existing scripts are unchanged.
start_req gains a prompt_token parameter so a script can issue a request
whose prompt does not share a radix prefix with another req (needed to make
a chunked req actually prefill instead of hitting a resident req's cache).

get_all_node_lock_refs now handles SWARadixCache TreeNodes, which track
full_lock_ref and swa_lock_ref separately instead of a single lock_ref.
The scenario never drove the KV pool tight enough to park the chunked req:
the resident competitor and r both used token id 1, so r hit the radix prefix
cache and prefilled almost nothing. Give r a distinct prompt token, keep the
competitor resident via ignore_eos, and use enable_mixed_chunk so the
competitor decodes alongside r's prefill chunks -- its decode growth exceeds
its 0.7x admission reservation and pushes the full pool to 1.00 on r's final
chunk, parking r (add_chunked_req early-return) until the competitor frees its
KV. All assertions (observed_early_return, r.finished, no locked nodes after
drain) are unchanged.
…lama

The scripted-runtime PP tests ran Llama-3.2-1B-Instruct, whose tied word
embeddings crashed under pipeline parallelism, which had forced a fix into
production llama.py. Switch SMALL_MODEL to Qwen3-0.6B, which already handles
tied embeddings under PP in qwen3.py, and revert the llama.py change.
…ng it

Revert the earlier chunked=True at the prefill-completion cache site in
batch_result_processor: that branch runs for every request finishing prefill
(chunked final chunk and non-chunked single prefill alike), so forcing
chunked=True suppressed the legitimate hit_count bump on the general prefill
path, not just chunked self-inserts.

A request inserts its prompt prefix twice -- via cache_unfinished_req when
prefill completes and via cache_finished_req on finish -- while decode nodes are
inserted once, so prompt nodes settle at hit_count 2 and decode nodes at 1,
identically for chunked and non-chunked prefill. #18843's middle-chunk
chunked=True still prevents the unbounded early-node inflation.

Replace the old all==1 expectation with side-by-side chunked and non-chunked
tests asserting the hit_count value set is exactly {1, 2}.
The scripted ctx.flush_cache() previously only confirmed the
FlushCacheReqInput arrived at the scheduler; it never checked the flush
actually happened. The scheduler's flush_cache() is a no-op returning
success=False whenever it is not fully idle, so a flush request on a busy
scheduler silently did nothing.

Make flush_cache a generator that yields once -- the scheduler consumes
the buffered FlushCacheReqInput during the same recv_requests() call, but
only after the script yields control back -- and then asserts the cache
is fully flushed: radix tree empty and both the KV allocator and the
req_to_token pool restored to their fully-free baselines (captured at
context init to sidestep page-rounding / padding-slot conventions).

Add assert_flushed (default True) so the deliberately-busy
flush-during-chunked test can opt out. Update all call sites to use
`yield from t.flush_cache()`.
Classify radix nodes by cumulative token position instead of only checking the
hit_count value set: nodes ending within the prompt (prefill) must each be hit
exactly twice, nodes beyond it (decode) exactly once. This pins the count to the
correct node role rather than merely asserting both values occur.
Record the flush outcome on SchedulerFlushWrapper.last_success and have
the scripted ctx.flush_cache() assert that scheduler.flush_cache()
returned True, rather than re-deriving flushed-ness from radix/pool
state. The scheduler already computes the authoritative success bool
(False whenever it is not fully idle); reading it directly is simpler and
exact.

The scripted helper clears last_success before posting and asserts it is
True after the yield that lets the scheduler run the flush. Drop the
pool-baseline capture in ScriptedContext.__init__.
Drop the assert-it-is-really-flushed logic from scripted ctx.flush_cache
and restore it to its original fire-and-confirm form, reverting the
generator signature, the pool-availability baselines on ScriptedContext,
and the `yield from t.flush_cache()` call sites. The flush verification
will be reworked separately.
Without enable_mixed_chunk and without an ignore_eos competitor, the
add_chunked_req hybrid-SWA early-return is unreachable through ordinary request
traffic: while a chunked req is being prefilled, prefill preempts decode (so a
running req never grows the pool) and no second req can be admitted (the chunk
budget is held), so nothing tightens rem_total_tokens mid-prefill. The previous
design relied on exactly that mixed-chunk decode kept alive by ignore_eos.

Reproduce the pool state directly instead: admit a single chunked req r on an
empty pool, then drain the full-attention allocator from the script so r's next
chunk cannot be admitted -> the real scheduler takes the early-return and parks
r (chunked_req set, _chunked_req_scheduled_last_iter False). Releasing the
reserved KV lets r resume; the test still asserts the stash gate left no radix
lock refs behind.

Add ScriptedContext.reserve_full_attention_kv / full_attention_available_size
(context/pool.py) for the deterministic full-pool pressure.
…e) into start_req API work

Conflicts in scripted_runtime start_req, both sides extended the same signature:
- context/req_starter.py: keep theirs' prompt_token (input_ids = [prompt_token]*prompt_len)
  and ours' priority/dp_rank payload passthrough; ignore_eos was added by both, kept once;
  use the shared sampling_params var (equivalent to theirs' inline {max_new_tokens, ignore_eos}).
- context/api.py: union the public start_req kwargs (ignore_eos, priority, dp_rank, prompt_token)
  in both the signature and the delegated call; ours' abort/list_active_reqs/engine_stats
  auto-merged in untouched.
lock_refs (ScriptedReqHandle): the radix lock_ref held on the req's last_node
(0 once last_node is cleared on release) -- a genuine multi-line derivation,
no instrumentation. batch_composition (ScriptedContext): a pure read of the
scheduler returning {prefill, decode, chunked, running} rid lists, with the
chunked req excluded from prefill/decode so the subsets stay disjoint. Add
registered tests for both.
Rework the regression to mirror issue #24252's actual trigger. The chunked-req
hybrid-SWA early-return is driven by rem_total_tokens <= 0, where the running
batch's reserved-decode offset is sum(max_new - output) * new_token_ratio. In
production a KV-pool-full retraction jumps new_token_ratio (log: 0.098 ->
0.7589), ballooning that reservation and parking a mid-prefill chunked req.

A retraction only fires on the decode path, and the scripted runtime steps
deterministically with overlap scheduling off, so decode (hence the ratio jump)
cannot run while a chunked req is being prefilled. So reproduce the retraction's
effect directly: a resident holder with a large max_new carries the reservation,
r is admitted at a low (decayed) ratio, then the ratio is jumped exactly as
retract_decode does -> r is parked via the early-return. Restoring the decayed
ratio lets r resume; the test asserts the stash gate left no radix lock refs.

Drop the earlier full-attention pool-drain helper (context/pool.py + the two
ScriptedContext methods); the ratio-jump path reproduces the real mechanism and
no longer needs it.
…ests

has_pending_chunk has no accessor and is equivalent to the existing
is_chunking (true while the req holds the chunked_req slot, false after
abort/pause/retract/finish); every assertion site is consistent with that,
so rename rather than add a redundant one-line wrapper.
…anual tests

pending_middle_outputs has no harness accessor and is exactly the existing
Req.inflight_middle_chunks (incremented per dispatched middle chunk, decremented
when its forward result is consumed, zeroed on retract/abort) -- a direct field
read, not instrumentation. Rename all references (incl. test/script names and
messages) to read r.req.inflight_middle_chunks. Drop the duplicate
test_inflight_middle_chunks_non_negative the rename collided with.
…inish guard

The scripted harness has no finish_event_count accessor, and 'finish emitted
exactly once' is already enforced by the engine: output_streamer asserts
'not req.finished_output' before streaming a finished req, and the req pool
asserts req_pool_idx is not None on free -- the scripted runtime surfaces any
such scheduler-side crash as a failed test. So the count assertion is redundant:
keep driving each scenario (abort / preempt / PP races) and assert the req
finished; a double-finish would have crashed the engine. == 1 -> assert
finished; <= 1 -> a comment noting the engine enforces at-most-one.
Add a single gated hook at the scheduler's run_batch chokepoint:
when scripted_scheduler_hook is set, record each run batch (forward_iter,
mode, rids, extend_rids, chunked_rid) into ScriptedSchedulerHook._batch_log
(driver rank only; log cleared after the per-script reset drain). No
scheduler business-logic change and zero production cost (hook is None
otherwise).

chunks_done(rid) is then derived in the harness as the number of logged
batches where the req held the chunked_req slot and actually ran -- exactly
the number of chunked prefill passes (0 when never chunked). Expose it as
ScriptedContext.chunks_done / ScriptedReqHandle.chunks_done, so the manual
suite's r.chunks_done works unchanged. Add registered tests
(<=chunk -> 0, chunk+2 -> 2, 5*chunk -> 5).
ForwardMode is an IntEnum, so .name.lower() is simpler than the is_*() chain
and also captures the modes the chain collapsed to 'unknown' (target_verify,
split_prefill, dllm_extend, ...).
@fzyzcjy

fzyzcjy commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. Update after merging latest main into this branch (which pulls in #27512): base-c-8gpu-h200 (2) / test_mimo_v2 is now SUCCESS#27512's multimodal pad-sentinel clamp fixed the scheduler-init crash (exit -3 from an out-of-vocab OOB gather in the MiMo-V2 + multi-layer-EAGLE draft path). Confirmed both by a fresh tip-of-main rerun-test (passed in 169s) and now in this PR's own CI. The earlier 'main also fails' cross-branch result was a timing artifact — my probe ran on a07d813, before #27512 landed.

Remaining red lanes are all unrelated to this PR's diff (PP cross-mb exclusion / scripted harness / gsm8k eval):

  • deepep-4gpu-h100 (test_deepep_small): torch.distributed.DistNetworkError ... EADDRINUSE ... port 43639 already in use → Rank 0 scheduler died during init — a CI-runner port-collision infra flake.
  • base-c-8gpu-h200 (0) (test_dsa_glm5_tp_mtp): throughput AssertionError: 141.63 not greater than 150 — a perf-threshold blip on a DSA/GLM5/MTP path this PR does not touch.
  • pr-test-extra-finish: aggregator cascade from the above.

Will /rerun-failed-ci once the round fully settles. Please push back if any conclusion is off.

@fzyzcjy

fzyzcjy commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@fzyzcjy

fzyzcjy commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. Full rundown of every lane that went red on this PR and what each one actually was.

PR-introduced — all fixed

  1. PP × chunked-prefill regression (TestRegressionPp reqs did not finish; test_e2e_pp gsm8k 0.49). The cross-mb in-flight exclusion SRT fix (69ef71edc4) had been dropped in an earlier main merge; once the harness's corrected finished-detection landed, the missing exclusion surfaced as a hang + KV corruption. Fix: cherry-picked the exclusion back, refined it to last-chunk EXTEND reqs only (it was over-excluding and pinning the long req forever), and added a batch-log finished-detection fallback for short-circuited a.finished and b.finished loops.
  2. num_examples > available test lines AssertionError across ~9 lanes (base-a-1gpu-small, base-c/extra-b b200+h200 → test_basic_sanity_eagle3 / test_mimo_v2 / test_deepseek_v3_fp4 / test_flashinfer_a2a). A hard assert I'd added in the GSM8KEval prefix-pool refactor; upstream silently caps via list slicing and several tests deliberately pass num_examples larger than the dataset. Dropped the assert → 9 failures → 1.
  3. base-c-8gpu-h200(2) / test_mimo_v2 server-startup crash (scheduler_0 exit -3). NOT a permanent infra timeout — it was an out-of-vocab OOB gather in the MiMo-V2 + multi-layer-EAGLE draft path, fixed upstream by [Spec] Clamp multimodal pad sentinels in spec-v2 draft prefill embedding #27512. My first cross-branch probe ran on a07d813 (before [Spec] Clamp multimodal pad sentinels in spec-v2 draft prefill embedding #27512 landed), which is why main 'also failed' then — a timing artifact. After merging latest main (which includes [Spec] Clamp multimodal pad sentinels in spec-v2 draft prefill embedding #27512) this lane is now SUCCESS (also confirmed by a tip-of-main rerun: passed in 169s).

Not PR-introduced — infra/perf, only surfaced after merging latest main

  • deepep lanes (test_deepep_small): EADDRINUSE — port 43639 already in use → Rank 0 scheduler died during init. A CI-runner port-collision infra flake; this PR doesn't touch MoE/DeepEP/dist-init. Most deepep lanes already re-ran green; the last couple are re-running now.
  • base-c-8gpu-h200(0) / test_dsa_glm5_tp_mtp: throughput 141.63 not > 150 — a perf-threshold miss on a DSA/GLM5/MTP path this PR's diff (PP exclusion / harness / gsm8k eval) does not touch. A tip-of-main cross-branch rerun is dispatched to confirm it's pre-existing.

Current: 70 SUCCESS / 0 FAILURE / 3 re-running. Will confirm the last reruns settle. Please push back if any conclusion is off.

@fzyzcjy

fzyzcjy commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. Final babysit result.

#27506 is fully green — 77 SUCCESS / 0 FAILURE / 0 running.

All previously-red lanes resolved, each with cross-branch confirmation:

  • deepep (EADDRINUSE port 43639 collision → Rank-0 scheduler died): re-ran green on every lane (4-gpu-h100, 4-gpu-b200, 8-gpu-h200). Confirmed a CI-runner port-collision infra flake — a tip-of-main test_deepep_small rerun also passed cleanly (635s, 0 EADDRINUSE).
  • base-c-8gpu-h200(0) / test_dsa_glm5_tp_mtp (throughput 141.63 not > 150): re-ran SUCCESS; the tip-of-main cross-branch rerun of the same test also passed → a perf-threshold blip, not this PR (which doesn't touch DSA/GLM5/MTP).
  • base-c-8gpu-h200(2) / test_mimo_v2: fixed upstream by [Spec] Clamp multimodal pad sentinels in spec-v2 draft prefill embedding #27512 (multimodal pad-sentinel OOB-gather clamp), picked up by merging latest main.

PR-introduced failures were all fixed earlier in this babysit: the PP×chunked-prefill exclusion regression, and the num_examples > available assert. Net: green. Please push back if any conclusion is off.

… test

_in_flight_other_mb_rids + filter_batch(exclude_in_flight_other_mb=...)
were cherry-picked from feat/stateless_scheduler_b (commit 69ef71e),
where a stateless-scheduler rewrite genuinely introduced a PP+chunked KV
corruption (gsm8k 0.66 -> 0.77). That rewrite derived has_pending_chunk
from admission-time state so it cleared one step too early, opening a
window where a req's last-chunk forward is still in flight in mb_a while
mb_b merges it into running_batch and decodes on empty output_ids.

upstream/main structurally prevents this window and never had it:
- per-mb last_mbs[mb_id] is committed only AFTER _pp_process_batch_result
  runs on that batch (scheduler_pp_mixin.py), so a req is merged for decode
  only once its forward result is processed;
- inflight_middle_chunks defers output append until every chunk forward is
  processed (batch_result_processor.py), so decode never runs with empty
  output_ids;
- has_pending_chunk / pending_middle_outputs (the mechanism the fix's
  docstring describes) never existed on main.

So on this branch the code was dead defensive belt-and-suspenders. The
accompanying test only asserted that the exclusion mechanism engages and
is self-consistent (structural), never that omitting it corrupts output,
and the only failure ever seen here was the fix's own over-exclusion. Drop
both. filter_batch and the two scheduler call sites now match main exactly.
@fzyzcjy

fzyzcjy commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. Removed the _in_flight_other_mb_rids PP cross-mb chunked-exclude code and its scripted test after an investigation concluded it is dead defensive code on main; remaining PP tests verified green. Please push back if any conclusion is off.

Removed _in_flight_other_mb_rids (PP cross-mb in-flight chunked-exclude) + its scripted test (4eb5bc64a0).

This code (_in_flight_other_mb_rids + filter_batch(exclude_in_flight_other_mb=...)) was cherry-picked from a separate stateless-scheduler rewrite branch (commit 69ef71edc4), where a genuine PP+chunked KV-corruption regression existed (gsm8k 0.66 → 0.77). That regression was introduced by that rewrite: it derived the "still has a pending chunk" flag from admission-time state, so it cleared one step too early, opening a window where a req's last-chunk forward is still in flight in mb_a while mb_b merges it into running_batch and decodes on empty output_ids.

main structurally prevents that window and never had the bug:

  • per-mb last_mbs[mb_id] is committed only after _pp_process_batch_result runs on that batch (scheduler_pp_mixin.py), so a req is merged for decode only once its forward result has been processed;
  • inflight_middle_chunks defers the output append until every chunk forward is processed (batch_result_processor.py), so a decode never runs with empty output_ids;
  • has_pending_chunk / pending_middle_outputs (the data model the removed docstring described) never existed on main.

So on this branch the code was dead belt-and-suspenders, and its scripted test was tautological — it only asserted that the exclusion mechanism engages and is self-consistent, never that omitting it corrupts output. (The stateless branch itself later dropped the cross-mb scan, concluding the per-mb last_mbs invariant suffices.) filter_batch and the two scheduler call sites now match main exactly.

Verification (sci-h200, 4×GPU, this commit, Qwen3-0.6B): TestRegressionPp + test_scripted_pp.py11 passed / 1 skipped / 0 failed. Notably test_pp_last_chunk_cross_mb_kv_correctness (a dedicated cross-mb last-chunk KV-correctness test) passes without the removed exclusion, confirming it was not load-bearing.

@fzyzcjy

fzyzcjy commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaged the CI failures on the latest head after merging current main; classifying them and planning reruns. Please push back if any conclusion is off.

CI triage for head f240a513a6 (after merging latest main via #27502). All current failures are infra/perf flakes, not caused by this PR.

Context: after the merge this branch's production code (python/sglang/srt/) is now identical to main (the merge reconciled the earlier divergence). The branch's own registered scripted tests pass on this exact merged commit (rcli, sci-h200 4×GPU: test_scripted_swa_1gpu.py + test_scripted_core_1gpu.py → 8 passed; TestRegressionPp + test_scripted_pp.py earlier → 11 passed). So the red lanes below are the standard suite, not this PR's tests.

Fingerprints:

  • HTTP 504 Gateway Timeout during env/dependency setup (dies before tests run) — base-b-test-1-gpu-small (4/5/6/8), base-b-test-2-gpu-large (1), base-c-test-4-gpu-gb300 (0). A transient CI package-fetch wave.
  • extra-b-test-4-gpu-h100 (0)test/registered/hicache/test_qwen35_hicache.py: zmq.error.ZMQError: Address already in use (tcp://*:5557) — port-collision infra flake (not this PR's 4-GPU scripted test).
  • extra-a-test-1-gpu-large (0)test/registered/perf/test_bench_serving_1gpu_large.py: AssertionError: 1010.2 not less than 900 — perf-threshold blip (not this PR's swa/core scripted test).
  • base-c-test-8-gpu-h20 (1) — chronic H20 lane, ignored per CI policy.

Plan: let the round finish, then rerun the failed jobs once the 504 infra wave clears.

@fzyzcjy

fzyzcjy commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

1 similar comment
@fzyzcjy

fzyzcjy commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@fzyzcjy

fzyzcjy commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. Final CI triage after re-running the flaky lanes. Please push back if any conclusion is off.

Rerun update for head f240a513a6: re-running the failed jobs cleared all the infra/perf flakes (the 504-setup wave, the test_qwen35_hicache zmq port collision, the test_bench_serving perf blip, and the chronic H20 lane all went green). Now 64 SUCCESS / 8 SKIPPED, with a few long multi-GPU lanes still finishing.

One persistent failure remains — and it is not caused by this PR:

  • extra-b-test-4-gpu-b200 (0)test/registered/lora/test_lora_gpt_oss_20b_logprob_diff.py: AttributeError: 'FusedMoEWithLoRA' object has no attribute 'hidden_size' (exit -9). It failed on both the original run and the rerun (persistent, not a flake).

Why it's main-inherited, not this PR's: after merging current main, this branch's production code (python/sglang/srt/) is byte-identical to main (git diff upstream/main -- python/sglang/srt/ is empty), and this PR touches no LoRA / MoE code — it only adds tests + scripted-runtime harness. So the FusedMoEWithLoRA.hidden_size regression lives in main's LoRA+MoE path and reproduces independently of this PR. (I have not dispatched a fresh tip-of-main data point for this specific test; happy to if a cross-branch confirmation is wanted.)

@fzyzcjy

fzyzcjy commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. Root-caused the one remaining CI failure and confirmed it is a pre-existing main regression, not this PR. Please push back if any conclusion is off.

Root cause of the sole remaining red lane extra-b-test-4-gpu-b200 (0) (test/registered/lora/test_lora_gpt_oss_20b_logprob_diff.py):

File ".../sglang/srt/models/gpt_oss.py", line 294, in forward_normal
    hidden_dim_unpadded = self.experts.hidden_size
AttributeError: 'FusedMoEWithLoRA' object has no attribute 'hidden_size'

When LoRA is enabled, self.experts is wrapped by FusedMoEWithLoRA (srt/lora/layers.py), which does not expose hidden_size (no attribute, no __getattr__ proxy to base_layer). The bare self.experts.hidden_size access in gpt_oss.py was added by #27063 ("[AMD] Optimize gpt-oss-120B performance") and does not account for the LoRA wrapper, so gpt-oss + MoE-LoRA crashes during CUDA-graph capture (exit -9).

This is not introduced by this PR:

  • After merging current main, this branch's python/sglang/srt/ is byte-identical to main (git diff upstream/main -- python/sglang/srt/ is empty), and this PR touches neither models/gpt_oss.py nor lora/layers.py — it only adds tests + the scripted-runtime harness.
  • The failing line is still present on tip-of-main (1ff7c627cd), and there are no commits to gpt_oss.py/lora/layers.py between this branch's merge base and tip-of-main — i.e. main itself is still broken on this path, so re-merging main would not clear this lane. A search of open PRs found none fixing this specific gpt_oss × FusedMoEWithLoRA.hidden_size regression.

Minimal fix (separate, on main code — out of scope for this test-only PR): expose hidden_size on the LoRA MoE wrapper (proxy to base_layer), or have gpt_oss.py read getattr(self.experts, "hidden_size", None) or self.experts.base_layer.hidden_size.

Everything else on this PR is green (all infra/perf flakes cleared on rerun).

@fzyzcjy

fzyzcjy commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triage of the current CI round after re-merging latest main (which picked up the gpt-oss LoRA fix). Classifying each red lane; none are caused by this PR. Please push back if any conclusion is off.

Status for head 17f139b55c (CI still finishing — ~19 lanes running). The earlier blocker is resolved: the FusedMoEWithLoRA.hidden_size regression is gone (fixed on main by #27528, pulled in via this merge), and this branch's python/sglang/srt/ remains byte-identical to main (test-only). The CPU-registered test_mixed_prefix_gsm8k.py passes.

All remaining red lanes are infra / borderline-flake, in code this PR does not touch:

  • b200 lanes (extra-b/base-b/base-c …-4-gpu-b200, …-deepep-…-b200) — two distinct non-PR causes observed:
    • dirty-runner GPU-memory pre-check abort: ERROR: memory >=10%: GPU 0 (63%) … FAIL — Aborting CI (dies before any test runs);
    • borderline accuracy flake: test/registered/spec/eagle/test_eagle_infer_beta_dp_attention.py::test_a_gsm8kAssertionError: np.float64(0.62) not greater than 0.62 (score landed exactly at the >0.62 threshold; spec-decode/DP-attn path, untouched here).
  • CPU lanestest/registered/model_loading/test_external_models.py exit -9 (OOM loading large external checkpoints, e.g. Kimi-K2-NVFP4).
  • base-c-…-8-gpu-h20 — chronic H20 machine lane, ignored per CI policy.
  • large 1-/2-gpu base-b lanes — same infra-flake flavor (and on Add mixed-prefix gsm8k eval and its CPU unit test #27502 they fan out via fast-fail since that PR has no bypass-fastfail).

Plan: not re-running mid-round (would cancel in-flight lanes). Once the round settles I'll re-run the failed jobs; the b200 lanes additionally need the runner fleet to be clean to pass. Will keep watching.

@fzyzcjy

fzyzcjy commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. Re-running the one remaining red lane (a perf-threshold flake) to confirm it's a flake. Please push back if any conclusion is off.

After re-running the failed jobs, every infra/b200 lane went green — the only remaining red is a perf-threshold flake, not this PR:

  • base-c-test-8-gpu-h200 (0)test/registered/models_e2e/test_dsa_glm5_tp_mtp.py: AssertionError: 140.9 not greater than 150 (throughput ~141 vs the strict >150 threshold). DeepSeek-style glm5 TP+MTP throughput test; this branch is test-only (python/sglang/srt/ is byte-identical to main) and touches no model/perf code, so this is a shared-runner perf blip, not a regression here.

Re-running this single file to confirm.

@fzyzcjy

fzyzcjy commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/models_e2e/test_dsa_glm5_tp_mtp.py

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Results for /rerun-test test/registered/models_e2e/test_dsa_glm5_tp_mtp.py:

🚀 8-gpu-h200 (1 test): ✅ View workflow run

cd test/ && python3 registered/models_e2e/test_dsa_glm5_tp_mtp.py

@fzyzcjy

fzyzcjy commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

🤖 Posted autonomously by Claude Code acting on the user's behalf. The single-file re-run of the last red lane passed, confirming it was a flake. Please push back if any conclusion is off.

Update: the /rerun-test test/registered/models_e2e/test_dsa_glm5_tp_mtp.py dispatch passed (Rerun Test workflow → success) — so the lone remaining red (base-c-test-8-gpu-h200 (0), dsa_glm5 throughput 141 vs >150) is confirmed a perf-threshold flake, not a regression here. The rollup's base-c-8-gpu-h200 (0) check is just the stale result from the original round (the /rerun-test runs as a separate workflow and doesn't overwrite that check); it'll clear on a normal failed-job rerun.

Net: this PR is effectively green — every one of the ~18 lanes that went red across this round was infra (504-setup wave, b200 dirty-runner GPU-mem pre-check, CPU-OOM on external-model loads, chronic H20) or a perf-threshold flake (dsa_glm5, eagle-dp-attn), all of which passed on re-run. This branch's production code (python/sglang/srt/) is byte-identical to main, so none of these are caused by this PR.

@fzyzcjy fzyzcjy changed the title Add abundant testing for chunked prefill Add more testing for chunked prefill Jun 9, 2026
Base automatically changed from tom/mixed_prefix_gsm8k_eval to main June 9, 2026 12:17
@fzyzcjy fzyzcjy merged commit 1368717 into main Jun 9, 2026
62 of 100 checks passed
@fzyzcjy fzyzcjy deleted the tom/scripted_runtime_and_chunked_testing branch June 9, 2026 12:19
@fzyzcjy

fzyzcjy commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant